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1. INTRODUCTION 

Parkinson’s disease (PD) is a complex neurological illness that, being classified as a degenerative, 
chronic, and progressive disease that affects a person’s movements [1], [2]. Most people are diagnosed 
during their 70s, although 15% of cases occur among people who are under 50 years of age. Its expansion 
rate is estimated to be 1.5% approximately for people aged over 65 years [3]. The Clinic pathological studies 
show that up to 25% of the patients with PD are diagnosed incorrectly [4], The accuracy of clinical diagnosis 
can reach approximately 90% within a period of 2 years and 9 months [5]. Diagnosing PD is rather difficult, 
up till now there is no blood test that can reveal whether a person has a PD or not. Such illness is usually 
diagnosed through clinical exams and brain scans. These methods are quite costly, sometimes erroneous, and 
need an elevated level of professional expertise. 

Machine learning (ML) is a technique for analyzing data, it automatically learns the information and 
attitudes of a system and perceives the complexity of patterns with ease [6]. Deep learning (DL) is considered 
a great evolution of machine learning. It is inspired by brain operationality; it uses a programmable neural 
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network [7] that authorizes the machines to make accurate decisions without needing interference from 
humans. 

A neural model with appropriate generalization can provide precise answers even when testing it 
with inputs that have never been experienced before in the training set [8], also DL offer high prediction 
performance compared to other ML methods such as support vector machine (SVM) and random forest (RF) 
[9]. In recurrent neural networks (RNN) with long short-term memory (LSTM), the impermanent correlations 
of the input data can be learned [10], which consists of blocks of memory that allows retaining input 
information for a long period [9]. The optimizer is a method to adjust the varied parameters of the model. 
optimizing the neural network is very beneficial for increasing the accuracy and reducing the loss. Instead of 
mapping inputs to outputs alone, the RNN-LSTM network has the capability of learning a mapping function 
from inputs to outputs over time. An explicit set of observations need not be pre-specified. The main 
contributions of this paper are: 

— Proposing an enhanced approach based on deep learning through using RNN-LSTM for early detection 
of PD using voice features. 

— Applying the proposed RNN-LSTM approach with a batch normalization layer after each hidden layer 
to standardize the outputs of the hidden layers. 

— Applying the adaptive moment estimation (ADAM) optimization algorithm for training the network by 
updating the weights of the network iteratively based on the training data while training. 

The rest of this paper is organized as; section 2 presents state-of-the-art studies for PD detection, 
section 3 describes the phases of the proposed approach, section 4 presents and discusses the obtained 
experimental results, section 5 presents conclusions and future work. 


2. RELATED WORK 

Classification techniques based on ML and DL would be a convenient tool for an accurate diagnosis 
to differentiate healthy people from individuals with PD. Zham et al. [11] used a naive bayes (NB) algorithm 
on handwriting tasks and spiral drawing, different measures were used for each task. The fourth task has 
achieved the best classification accuracy with 83.2%. Taleb et al. [12] used a feature selection technique on 
handwriting tasks based on statistical tests and the SVM classifier. The feature giving the highest 
classification performance is picked up firstly. Features were provided separately one by one as an input to 
the SVM classifier. The highest classification accuracy obtained of a solitary feature was 87.5%. Then, 
features were fed continuously one after another until they get 86 features. The best classification accuracy of 
a group of features was 96.875% for N=12 features. Drotár et al. [4] compared three different classifiers: K- 
nearest neighbors (K-NN), ensemble AdaBoost classifier, and SVM on parkinson’s disease handwriting 
based on pressure and kinematic features using (PaHaW) dataset. SVM obtained the best result of all three 
classifiers with an accuracy of 81.3%. Also, Drotár et al. [13] used SVM on handwriting features to classify 
the PD patients, the accuracy was 88.1% for 162 handwriting features. 

Moreover, in Drotar et al. [14] they used SVM classifier for measuring the in-air and on-surface 
kinematic variables of the handwriting features of the PD patients. The achieved accuracies were 84% for in- 
air movement, 78% for on-surface movement, and 85% for both in the air + on surface movement. On the 
other hand, in [15]. Afonso et al used the optimum-path forest (OPF), deep-hierarchical OPF (dOPF), and k- 
means algorithms for the identification of parkinson’s disease on the handwriting of spiral and meander 
features, the best result was for the K-means algorithm with an accuracy=84.17%. Pereira et al. [16] applied 
a convolutional neural network (CNN) on spiral and meander hand drawing features of PD patients, the 
accuracy for 128*128 meander images was 87.14% and the accuracy for 128*128 spiral images was 77.92%. 

Also, Pereira et al. [17] used three classifiers NB, OPF, and SVM on the handwriting of spiral 
drawing, the NB classifier obtained the best result with accuracy=78.9%. Heremans et al. [18] used 
handwriting features to estimate the quality of writing in PD patients with and without freezing of gait 
(FOG). The writing qualities were severely affected by patients with FOG. Grover et al. [19] in this survey 
used deep neural network (DNN) on UCI’s voice dataset with three layers: input, hidden and output layer. 
The classification accuracy was 94.4% for training and 62.7% for testing. Saikia et al. [20] used an artificial 
neural network to classify PD patients from healthy controls in addition to providing the different progression 
stages of the disease based on the Electroencephalogram and the Electromyogram features. In [21] proposed 
a model for detecting the PD disease via smell signature using two sensors to analyze the sweat components 
and comparing these components between the PD and non-PD individuals. In [22] compared the 
classification accuracies of five different classifiers, the SVM, NB, KNN, DT, and the LDA, relying on gait 
dynamics. The average accuracy of the first three classifiers was 96.8% and 93.5% for the last two classifiers. 

Shinde et al. [23] used the rate of eye blinking per minute to determine parkinsonism, where if the 
rate is higher than ten blinks per minute the individual is considered as having PD. In order to enhance the 
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detection of patients with PD, in this paper, we proposed a RNN with LSTM and ADAM optimizer based on 
different voice features. Despite that LSTM requires some memory, RNN with LSTM can deal with large 
datasets without increasing the size of the model. Also, LSTM is more effective in comparison to the 
traditional time series models as it learns long-term dependencies that use former time proceedings to inform 
the next ones, so it allows information to persist and achieves best results. 

The proposed model overcomes the disadvantage of existing models with respect to the limited 
dataset and features that seriously affect the accuracy of PD prediction. In addition to emphasizing the benefit 
of accumulation, as traditional neural networks applying direct feedforward appears shortcoming, meanwhile, 
RNN with LSTM is considered as a loop network that learns long-term dependencies, which enhance the 
prediction. Different measures were used to validate the model. 


3. RESEARCH METHOD 

The proposed model embraces three main phases listed is being as; preprocessing phase, 
optimization phase, and classification phase. The framework of the proposed model for diagnosing 
parkinson’s diseases based on speech features is illustrated in Figure 1. 
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Figure 1. The proposed approach 


The proposed model structure consists of seven layers (input layer, 5 hidden layers, and the output 
layer). LSTM input layer contains 27 neurons a neuron for each feature, five LSTM hidden layers, a 27 
neurons dense layer followed by a two-neuron dense layer as an output layer. Each LSTM layer is appended 
by a dropout and a batch normalization layer. The dropout regularizes the input and the recurrent connections 
to the LSTM units by excluding some inputs from activation (drops them out) based on statistical 
calculations. The batch normalization layer standardizes the outputs of the hidden layer by normalizing the 
values coming from the previous layer. The batch normalization layer reduces the overfitting as it has a slight 
regularization effect which improves the performance of the model. 
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Finally, a 27 neuron dense layer followed by a fully connected dense layer, where all neurons in the 
previous layer are connected to that layer, the last dense layer works as the output layer. The following 
subsections illustrate the details of each phase. 


3.1. Preprocessing phase 

This phase worked to collect and prepare the data for the following phases to improve the results 
and suppress the effect of outliers in it. Min-max normalization was applied to make every datapoint have the 
same range of values so each feature is equally important. This is done via (1). This process helps to have 
small standard deviations, which can suppress the effect of outliers. 


X-Xmin 


(1) 


Xnorm “ame Aan 

3.2. Optimization phase 

The main goal of deep learning and machine learning is reducing the diversity between the actual 
output and the predicted output. This is known as the cost function or loss function. To assure adequate 
generalization of an algorithm and to diminish the cost function by detecting the optimized value of the 
weights appears the urge of using optimization via training the neural network. This makes a better prediction 
for the data that was not seen before. 

In the proposed model two different optimizers were used, the commonly known SGD optimizer 
and the most widely used optimizer for deep learning models the ADAM optimizer. The ADAM optimizer 
has achieved the best performance, and this will be displayed in 3.2.4. subsection. ADAM optimizer [24], 
[25] is one of the most recommended optimization techniques, it is essentially combining the advantages of 
the stochastic gradient descent (SGD) with momentum algorithm and the root mean square (RMS). The 
advantages of ADAM could be pointed out in the following points: 

— The ADAM algorithm doesn’t need high memory requirements. 

— The ADAM algorithm makes use of the average of the second moments of the gradients not only adapting 
the learning rates based on the average of the first moments. The first moment is mean, and the second 
moment is uncentered variance. 

— The ADAM algorithm works very well even with a little regulation of hyperparameters. 

The ADAM optimizer works according to the following steps: 

a. Initiate the 1st moment m0=Zero, initiate the 2nd moment n0O=Zero, and initialize the first time 
period T=Zero. 
b. Update the bias of the 1st and 2nd moments, this is shown in (2), (3). 


m=: @ m- (1-81) @ dw (2) 

n=f28 nei (1- B2) @ dw? (3) 
c. Calculate bias-corrected of the 1st and 2nd moments, as shown in (4), (5). 

Mecorr=Me! (1- Bit) (4) 

Necorr=Ne/ (1- Bre) (5) 
d. Update the parameters P and S, see (6), (7). 

P=P- £ Q mucorr/(Wnecorr ® € ) (6) 

S=S- £ @ Mecorr/(Wnecorr ® £) (7) 


Where; Bland £2 are hyperparameters with default values of 0.9 and 0.999 respectively. e is the learning rate 
s=107?. The ADAM optimizer is shown in Figure 2. 


3.3. Classification phase 

The proposed model applied RNN with LSTM for classifying healthy individuals from PD patients 
and used the ADAM optimizer to update the weights of the network iteratively, this will be illustrated in 
more details in the next subsections. 
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Figure 2. ADAM optimizer 


3.3.1. Recurrent neural networks 

RNN is a generalization of a feedforward neural network that contains an internal memory. In RNN 
the output of the current input relies on the prior computation. After getting the output, it is copied and sent 
back into the recurrent network. For making a decision, RNNSs use the internal memory to operate on a series 
of inputs where all the inputs are associated with each other. 


3.3.2. Long short-term memory 

LSTM uses back-propagation for training. LSTM network has mainly three gates. input gate, forget 
gate, and the output gate. The input gate uses a sigmoid function to decide which values from the input shall 
be activated and modify the memory. The forget gate determines what details from the previous state could 
be discarded from the block. Finally, the output gate controls the output. 


3.3.3. Regularization with dropout 

In general, the most common problem that neural network models suffer from is overfitting. 
Overfitting could be explained as that the model has a good performance with the training dataset but does 
not perform very well with the test dataset. To overcome this problem, the proposed model applied the 
dropout regularization technique. The dropout is carried out on both the training and testing states. The 
dropout parameter value used was 0.2. 


3.3.4. The recurrent neural networks model with adam optimizer 

The RNN model comprises an Input layer, then passed to five LSTM hidden layers, and the last 
layer is the output layer. Now, elaborating on the application of the ADAM optimizer on the proposed 
Recurrent Neural Networks model in more detail. The dataset is loaded and all the data is normalized into 
values between 0 and 1. The training data is processed for a batch size of 104 sample records and 10 epochs. 
The training data is compiled with the ADAM optimizer which updates the weights of the network 
iteratively, using sparse_categorical_crossentropy loss function with learning rate=0.001 and decay=1e-4. 
The network structure of the proposed model is shown in Table 1. 


Table 1. Network structure 








Description Value 
Number of network layers 7 
Number of hidden layers 5 
Learning rate 0.01 
decay le-4 
Patch size 104 
Number of epochs 10 
Loss function sparse_categorical_crossentropy 
Activation function SoftMax 
Number of training samples 1040 
Number of test samples 168 
optimization ADAM optimizer 
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Table 2 shows the proposed model performance with ADAM optimizer and the performance of the 
typical RNN “RNN with stochastic gradient descent (SGD) optimizer”. From Table 2 the ADAM optimizer 
has improved the accuracy of the proposed model by approximately 15.6% more than the typical RNN. 


Table 2. Performance of the RNN with ADAM and SGD optimizers 
Measurements RNN with ADAM RNN with SGD 








Accuracy 95.8% 80.2% 
Recall 100% 78.8% 
Precision 92.3% 87.8% 
F-score 96% 83.05% 





4. EXPERIMENTAL RESULTS AND DISCUSSION 

In this section, we discuss the optained results through presenting the used datasets with brief details 
about the features of each dataset, the experimental settings, the measures used to validate the model 
performance. Also, we present a comparison between the proposed model and the model presented by Grover 
et al. [19] that addresses the same problem based on the accuracy performance and the structure of the two 
models. Moreover, we examine the accuracies and some validation measures of the different ML algorithms 
such as RNN with ADAM optimizer, RNN with SGD, SVM, and K-NN that we applied on the two datasets 
in order to highlight the best model for detecting PD. Finally, we show a performance comparison between 
the proposed model and other related works. 


4.1. Datasets and experimental setting 

In our experiment, we work with Python programming language along with TensorFlow and Keras 
libraries. The proposed model implemented a RNN with LSTM along with ADAM optimizer and a 
sparse_categorical_crossentropy loss function. We also consider the presented model of [19] that used a feed- 
forward neural network with three hidden layers. Two benchmark datasets of speech features are used in this 
study. The first PD dataset (DS1) is the parkinson’s telemonitoring voice dataset from the UCI public 
repository of datasets [26]. This dataset consists of 1040 samples for training and 168 samples for testing 
with 27 voice features. 

The second dataset (DS2) is created by Max Little of the University of Oxford, in collaboration with 
the National Centre for Voice and Speech, this dataset contains 195 samples 130 samples for training, and 65 
samples for testing with 22 voice features [27]. When applying the second dataset, we modified the number 
of neurons in the hidden layers of the network to be 22 neurons according to the number of the voice features 
and kept the same network structure. Details of the features of both dataset’s are listed in Table 3. 


4.2. Results 

We used different measures to validate our model, these measures are accuracy, recall, precision, 
and F-score. Where true positive (TP), true negatives (TN), false positive (FP), and false negatives (FN) as 
shown in (8)-(11). 


TP+TN 
Accuracy=——_———. (8) 
TP+FP+TN+FN 
TP 
Recall=——— (9) 
TP+FN 
eit TP 
Precision=——— (10) 
TP+FP 
Precision*Recall 
F-score=2 # oe (11) 
Precision+Recall 


The accuracy of a model is a method to measure how the model correctly classifies the data. It is the 
ratio between the correctly predicted samples to the whole number of the prediction samples. Precision is the 
ratio of the rightfully predicted as positive by the model to all positives, in other words, precision clarifies 
how many predicted PD patients are actually PD. Recall measures how correctly the model identifies true 
positives, in the proposed model the recall shows how many PD patients are correctly predicted. F-score is 
the average of the recall and precision. The obtained classification accuracy of our model on the first dataset 
was 95.8%, in comparison to the proposed methodology by Grover et al. [19], which was 62.7%. This shows 
that our proposed model has the discrimination of 33.1% for the classification accuracy over the 
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methodology presented in [19]. Table 4 presents a brief comparison between the structure of the two models 
and the accuracy performance of each model. 


Table 3. Datasets features 








Data Features for (DS1) Data Features for (DS2) 
Feature Feature description Feature Feature Feature Feature description Feature Feature 
description description 
Jilter Several of variation Median Pitch parameters MDVP: Average vocal HNR Measures of 
(local) in fundamental pitch Fo (Hz) fundamental ratio of 
frequency frequency noise to 
tonal 
components 
in the voice 
Jitter Several of variation Mean Pitch parameters MDVP: Minimum vocal RDPE nonlinear 
(local in fundamental pitch Flo (Hz) fundamental dynamical 
absolure) frequency frequency complexity 
measures 
Jitter Several of variation Standard Pitch parameters MDVP: Several measures of D2 nonlinear 
(rap) in fundamental deviation Jitter variation in dynamical 
frequency (Abs) fundamental complexity 
frequency measures 
Jitter Several of variation Minimum Pitch parameters MDVP: Several measures of spread1 nonlinear 
(ppq5) in fundamental pitch RAP variation in measures of 
frequency fundamental fundamental 
frequency frequency 
variation 
Jitter Several of variation Maximum Pitch parameters MDVP: Several measures of — spread2 nonlinear 
(ddp) in fundamental pitch RAP variation in measures of 
frequency fundamental fundamental 
frequency frequency 
variation 
Shimmer Several measures of | Number Pulse Parameters MDVP: Several measures of | PPE nonlinear 
(local) variation in of pulses Jitter variation in measures of 
amplitude (%) fundamental fundamental 
frequency frequency 
variation 
Shimmer Several measures of | Number Pulse Parameters MDVP: Several measures of DFA Signal 
(local, variation in of periods PPQ variation in fractal 
dB) amplitude fundamental scaling 
frequency exponent 
Shimmer Several measures of | Mean Pulse Parameters Jitter: Several measures of MDVP: Maximum 
(apq3) variation in period DDP variation in Fhi (Hz) vocal 
amplitude fundamental fundamental 
frequency frequency 
Shimmer Several measures of Standard Pulse Parameters MDVP: Several measures of 
(apq5) variation in deviation Shimmer variation in 
amplitude of period amplitude 
Shimmer Several measures of Fraction Voicing MDVP: Several measures of 
(apql1) variation in of locally parameters Shimmer variation in 
amplitude unvoiced (dB) amplitude 
frames 
Shimmer Several measures of | Number Voicing Shimmer Several measures of 
(dda) variation in of voice parameters : APQS variation in 
amplitude breaks amplitude 
AC Harmonicity Degree of Voicing MDVP: Several measures of 
Parameters Voice parameters APQ variation in 
Breaks amplitude 
NTH Harmonicity UPDRS Voicing Shimmer Several measures of 
Parameters parameters : DDA variation in 
amplitude 
HTN Harmonicity NHR Measures of ratio of 
Parameters noise to tonal 
components in the 
voice 





From Table 4 the proposed approach had a higher accuracy than the approach of Grover et al. [19] 
by approximately 33%. Different ML algorithms were applied to find out the best model for predicting the 
possibility of having parkinson’s disease, these algorithms are the RNN with ADAM optimizer, RNN with 
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SGD, SVM, and K-NN. The accuracies of the applied models on the two voice datasets (DS1) and (DS2) are 
shown in Figure 3. 


Table 4. DNN and RNN comparison 











Neural network DNN RNN with ADAM optimizer 
Number of hidden layers 3 5 
Type of Neural Network Feed forward neural network RNN 
Memory - LSTM 
Data normalization Min-max normalization Min-max normalization 
Optimizer - ADAM optimizer 
Loss function - sparse_categorical_crossentropy 
Number of neurons in hidden layers 10,20,10 27 
Measurements Accuracy accuracy, recall, precision, F-score 
Testing accuracy 62.7335% 95.8% 
% 
100 amine 93.9 % 
90 % 
90 
82.2% 30.2% 
80 72.5% wee 71.42% 
70 
60 
50 
40 
30 
20 
10 
0 
RNN with ADAM RNN with SGD SVM K-NN 
m(DS1) ` (DS2) 


Figure 3. Average accuracies of the different models 


Figure 3 shows that the RNN model with ADAM optimizer on the first dataset (DS1) increased the 
accuracy of the classification by 15.6% in comparison to the RNN with SGD, achieved better classification 
accuracy by 5.8% than the SVM algorithm, and improved the accuracy by 1.9% than the K-NN. 

Also, Figure 3 illustrates that the RNN model with ADAM optimizer has maintained the best 
accuracy performance on the second dataset (DS2) with a difference of 9.7%, 7.4%, and 10.7% versus the 
RNN with SGD, SVM, and the KNN models respectively. These results have shown that the RNN model 
with ADAM optimizer has achieved the best classification result on both voice datasets. Table 5 shows the 
performance of these models on the two datasets based on the recall, precision, and the F-score. 


Table 5. Validation measurements of the different models on (DS1) and (DS2) 
ML Algorithm Dataset Recall Precision F-score 








RNN With (DS1) 100% 92.3% 96% 
ADAM (DS2) 99% 82.2% 90.24% 
SVM (DS1) 85% 100% 92% 

(DS2) 74% 100% 85% 
K-NN (DS1) 92% 99% 95% 
(DS2) 71% 100% 83% 


RNN with SGD (DS1) 78.8% 87.8% 83.05% 
(DS2) 79.2% 72.5% 75.35% 





From Table 5 the RNN model with ADAM optimizer has achieved high results on the different 
datasets by various validation measures. Tthese results highlight the benefits of the LSTM along with the 
ADAM optimizer. 
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The achieved result of the different models applied on the second dataset (DS2) could have lower 
performance due to the small number of samples in comparison to the first dataset (DS1). Table 6 compares 
the validation performance between previous surveyed studies with different models and datasets with the 
performance of the proposed approach for detecting PD. 


Table 6. Different models performance comparison 








Work Classifier Dataset Accuracy Recall Precision 
[11] NB Hand Writing 83% 83.2% 83.2% 
[14] SVM Hand Writing 85% 85.2% 85.9% 
[17] NB Hand Writing 78.9% 91% 24% 
[19] DNN Voice 62.7% - - 
Proposed RNN with ADAM Voice (DS1) 95.8% 100% 92.3% 
approach Voice (DS2) 82.2% 99% 82.2% 





Moreover, the matthews correlation coefficient (MCC) of the proposed model with the first dataset 
(DS1) was calculated, and it gives 92.04%. MCC considers all the TP, FP, TN, and FN values, and the high 
value of the MCC (near to 1) means that the two classes were properly predicted, even in case one of the two 
classes is disproportionately represented. MCC can be calculated from (12). 


a TP*TN—FP«FN 
~ /(PP+FP)(TP+FN)(TN+FP)(TN+FN) 


MCC 








(12) 


The elapsed time for the whole process was 20 minutes with 104 epochs. Each epoch takes 
approximately 11 seconds. 


5. CONCLUSION 

In this paper, we presented a model with the aim to diagnose parkinson’s disease with less human 
interference and in a much cheaper and more efficient way. A RNN with LSTM and ADAM optimizer was 
used with sparse_categorical_crossentropy loss function and the SoftMax activation function. The model was 
applied in two different voice datasets, and multiple measures were computed to evaluate the model 
performance. The achieved accuracy on the first dataset is 95.8%, the recall is 100%, the precision is 92.3%, 
and the F-score is 96%. For the second dataset, the proposed approach obtained an accuracy of 82.2%, 99% 
for recall, 82.2% for precision, and 90.24 % for F-score. For future work, we will work on considering more 
voice features with other kinematic features like handwriting features. 
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